AITopics | multimodal feature

Collaborating Authors

multimodal feature

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

4bbdcc0e821637155ac4217bdab70d2e-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 18:55:24 GMT

artificial intelligence, exp, machine learning, (16 more...)

Neural Information Processing Systems

Country: Oceania > Australia (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.71)

Add feedback

UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models

Neural Information Processing SystemsApr-24-2026, 09:50:41 GMT

In this study, we investigate the task of data pre-selection, which aims to select instances for labeling from an unlabeled dataset through a single pass, thereby optimizing performance for undefined downstream tasks with a limited annotation budget. Previous approaches to data pre-selection relied solely on visual features extracted from foundation models, such as CLIP and BLIP-2, but largely ignored the powerfulness of text features. In this work, we argue that, with proper design, the joint feature space of both vision and text can yield a better representation for data pre-selection. To this end, we introduce UP-DP, a simple yet effective unsupervised prompt learning approach that adapts vision-language models, like BLIP-2, for data pre-selection. Specifically, with the BLIP-2 parameters frozen, we train text prompts to extract the joint features with improved representation, ensuring a diverse cluster structure that covers the entire dataset. We extensively compare our method with the state-of-the-art using seven benchmark datasets in different settings, achieving up to a performance gain of 20%. Interestingly, the prompts learned from one dataset demonstrate significant generalizability and can be applied directly to enhance the feature extraction of BLIP-2 from other datasets. To the best of our knowledge, UP-DP is the first work to incorporate unsupervised prompt learning in a vision-language model for data pre-selection.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

From Raw Features to Effective Embeddings: A Three-Stage Approach for Multimodal Recipe Recommendation

Shin, Jeeho, Kim, Kyungho, Shin, Kijung

arXiv.org Artificial IntelligenceDec-1-2025

Recipe recommendation has become an essential task in web-based food platforms. A central challenge is effectively leveraging rich multimodal features beyond user-recipe interactions. Our analysis shows that even simple uses of multimodal signals yield competitive performance, suggesting that systematic enhancement of these signals is highly promising. We propose TESMR, a 3-stage framework for recipe recommendation that progressively refines raw multimodal features into effective embeddings through: (1) content-based enhancement using foundation models with multimodal comprehension, (2) relation-based enhancement via message propagation over user-recipe interactions, and (3) learning-based enhancement through contrastive learning with learnable embeddings. Experiments on two real-world datasets show that TESMR outperforms existing methods, achieving 7-15% higher Recall@10.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.19176

Country: Asia > South Korea (0.14)

Genre:

Research Report (0.64)
Overview (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)

Add feedback

Modality-Collaborative Low-Rank Decomposers for Few-Shot Video Domain Adaptation

Wanyan, Yuyang, Yang, Xiaoshan, Dong, Weiming, Xu, Changsheng

arXiv.org Artificial IntelligenceNov-25-2025

Abstract--In this paper, we study the challenging task of Few-Shot Video Domain Adaptation (FSVDA). The multimodal nature of videos introduces unique challenges, necessitating the simultaneous consideration of both domain alignment and modality collaboration in a few-shot scenario, which is ignored in previous literature. We observe that, under the influence of domain shift, the generalization performance on the target domain of each individual modality, as well as that of fused multimodal features, is constrained. Because each modality is comprised of coupled features with multiple components that exhibit different domain shifts. This variability increases the complexity of domain adaptation, thereby reducing the effectiveness of multimodal feature integration. T o address these challenges, we introduce a novel framework of Modality-Collaborative Low-Rank Decomposers (MC-LRD) to decompose modality-unique and modality-shared features with different domain shift levels from each modality that are more friendly for domain alignment. The MC-LRD comprises multiple decomposers for each modality and Multimodal Decomposition Routers (MDR). Each decomposer has progressively shared parameters across different modalities. The MDR is leveraged to selectively activate the decomposers to produce modality-unique and modality-shared features. T o ensure efficient decomposition, we apply orthogonal decorrelation constraints separately to decomposers and sub-routers, enhancing their diversity. Furthermore, we propose a cross-domain activation consistency loss to guarantee that target and source samples of the same category exhibit consistent activation preferences of the decomposers, thereby facilitating domain alignment. Extensive experimental results on three public benchmarks demonstrate that our model achieves significant improvements over existing methods.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.18711

Country:

Europe (0.28)
Asia > China (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.92)
(3 more...)

Add feedback

M2R2: MultiModal Robotic Representation for Temporal Action Segmentation

Sliwowski, Daniel, Lee, Dongheui

arXiv.org Artificial IntelligenceNov-25-2025

Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel pretraining strategy that enables the reuse of learned features across multiple TAS models. Our method achieves state-of-the-art performance on the REASSEMBLE dataset, a challenging multimodal robotic assembly dataset, outperforming existing robotic action segmentation models by 46.6%. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.

artificial intelligence, machine learning, segmentation, (18 more...)

arXiv.org Artificial Intelligence

2504.18662

Country: Europe > Austria (0.28)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

Hori, Chiori, Masuyama, Yoshiki, Jain, Siddarth, Corcodel, Radu, Jha, Devesh, Romeres, Diego, Roux, Jonathan Le

arXiv.org Artificial IntelligenceNov-24-2025

Abstract--Human-robot collaboration towards a shared goal requires robots to understand human action and interaction with the surrounding environment. This paper focuses on human-robot interaction (HRI) based on human-robot dialogue that relies on the robot action confirmation and action step generation using multimodal scene understanding. The state-of-the-art approach uses multimodal transformers to generate robot action steps aligned with robot action confirmation from a single clip showing a task composed of multiple micro steps. Although actions towards a long-horizon task depend on each other throughout an entire video, the current approaches mainly focus on clip-level processing and do not leverage long-context information. This paper proposes a long-context Q-former incorporating left and right context dependency in full videos. Furthermore, this paper proposes a text-conditioning approach to feed text embeddings directly into the LLM decoder to mitigate the high abstraction of the information in text by Q-former . Experiments with the Y ouCook2 corpus show that the accuracy of confirmation generation is a major factor in the performance of action planning. Furthermore, we demonstrate that the long-context Q-former improves the confirmation and action planning by integrating VideoLLaMA3.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.17335

Genre: Research Report (1.00)

Industry: Education > Educational Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

Add feedback

PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation

Tan, Bin, Ge, Wangyao, Wang, Yidi, Liu, Xin, Burtoft, Jeff, Fan, Hao, Wang, Hui

arXiv.org Artificial IntelligenceSep-9-2025

Modern app store recommender systems struggle with multiple-category apps, as traditional taxonomies fail to capture overlapping semantics, leading to suboptimal personalization. We propose PCR-CA (Parallel Codebook Representations with Contrastive Alignment), an end-to-end framework for improved CTR prediction. PCR-CA first extracts compact multimodal embeddings from app text, then introduces a Parallel Codebook VQ-AE module that learns discrete semantic representations across multiple codebooks in parallel -- unlike hierarchical residual quantization (RQ-VAE). This design enables independent encoding of diverse aspects (e.g., gameplay, art style), better modeling multiple-category semantics. To bridge semantic and collaborative signals, we employ a contrastive alignment loss at both the user and item levels, enhancing representation learning for long-tail items. Additionally, a dual-attention fusion mechanism combines ID-based and semantic features to capture user interests, especially for long-tail apps. Experiments on a large-scale dataset show PCR-CA achieves a +0.76% AUC improvement over strong baselines, with +2.15% AUC gains for long-tail apps. Online A/B testing further validates our approach, showing a +10.52% lift in CTR and a +16.30% improvement in CVR, demonstrating PCR-CA's effectiveness in real-world deployment. The new framework has now been fully deployed on the Microsoft Store.

machine learning, natural language, pcr-ca, (16 more...)

arXiv.org Artificial Intelligence

2508.18166

Country: Asia > Middle East > UAE (0.16)

Genre: Research Report (0.64)

Industry:

Information Technology (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.66)

Add feedback

Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling

Ming Hou, Jiajia Tang, Jianhai Zhang, Wanzeng Kong, Qibin Zhao

Neural Information Processing SystemsAug-20-2025, 09:27:19 GMT

Tensor-based multimodal fusion techniques have exhibited great predictive performance.

hpfn, interaction, louis-philippe morency, (15 more...)

Neural Information Processing Systems

Country:

North America > Canada (0.04)
Asia > Japan (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Africa > Senegal > Kolda Region > Kolda (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.97)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.68)
(2 more...)

Add feedback

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding

Gu, Xin, Shen, Yaojie, Luo, Chenxi, Luo, Tiejian, Huang, Yan, Lin, Yuewei, Fan, Heng, Zhang, Libo

arXiv.org Artificial IntelligenceFeb-16-2025

Transformer has attracted increasing interest in STVG, owing to its end-to-end pipeline and promising result. Existing Transformer-based STVG approaches often leverage a set of object queries, which are initialized simply using zeros and then gradually learn target position information via iterative interactions with multimodal features, for spatial and temporal localization. Despite simplicity, these zero object queries, due to lacking target-specific cues, are hard to learn discriminative target information from interactions with multimodal features in complicated scenarios (\e.g., with distractors or occlusion), resulting in degradation. Addressing this, we introduce a novel Target-Aware Transformer for STVG (TA-STVG), which seeks to adaptively generate object queries via exploring target-specific cues from the given video-text pair, for improving STVG. The key lies in two simple yet effective modules, comprising text-guided temporal sampling (TTS) and attribute-aware spatial activation (ASA), working in a cascade. The former focuses on selecting target-relevant temporal cues from a video utilizing holistic text information, while the latter aims at further exploiting the fine-grained visual attribute information of the object from previous target-aware temporal cues, which is applied for object query initialization. Compared to existing methods leveraging zero-initialized queries, object queries in our TA-STVG, directly generated from a given video-text pair, naturally carry target-specific cues, making them adaptive and better interact with multimodal features for learning more discriminative information to improve STVG. In our experiments on three benchmarks, TA-STVG achieves state-of-the-art performance and significantly outperforms the baseline, validating its efficacy.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.11168

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Filters

Collaborating Authors

multimodal feature

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

4bbdcc0e821637155ac4217bdab70d2e-Supplemental.pdf

UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models

06d5f1fe6509b001e6d4e0ec1afd83dd-Paper-Conference.pdf

From Raw Features to Effective Embeddings: A Three-Stage Approach for Multimodal Recipe Recommendation

Modality-Collaborative Low-Rank Decomposers for Few-Shot Video Domain Adaptation

M2R2: MultiModal Robotic Representation for Temporal Action Segmentation

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM

PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation

Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding